Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4496 topo serving shards refactor #4631

Merged
merged 40 commits into from
Mar 19, 2019

Conversation

rafael
Copy link
Member

@rafael rafael commented Feb 14, 2019

Description

Things to keep an eye

  • Something that we didn't caught during the design, is that we still need to keep the concept of a tablet control. This allow MigrateServedTypes to first disable query service in sources, enable in destinations and then update the srvKeyspace.
  • @sougou - I didn't remove SourceShards as an element to disable/enable query service in tablets. It was not required for this refactor and thought that it might be better to that as a separate refactor.
  • One part that I don't like, is that there are still some things in the global topo, that could be move down to the cells. Specifically vertical resharding. Once we move that, this refactor will be complete. In the meantime, I feel like we are going in the right direction, but it will be a bit messy.

Pending Tasks

  • Add tests for the newly added functions in srv_keyspace.
  • Add tests for multi shard split in go/vt/wrangler/testlib. I realized there are cases that are covered in integration tests that are not here.
  • Implement some utility functions to manipulate SrvKeyspace by hand. This used to be an option that I think we should keep for emergencies. I still need to do that.
  • Add backwards compatibility for ServedTypes when generating serving keyspace. Moving forward, we are using the isMasterServing attribute in the shard to decide if we should include a shard in the serving keyspace. We need to make sure that this works for legacy topologies that won't have that set.
  • Fix some tests that are still failing on travis.
  • Do not rebuild keyspace if there are tablet control records set. This means that we are in the process of migrate serving types.

Instead of updating the serving keyspace one time per shard, just do it a single
time with all the shards that we need to update.

Signed-off-by: Rafael Chacon <[email protected]>
* Make sure that we get all cells when non is provided in FindAllTablets
* Add function to migrate served types in srv vschema

Signed-off-by: Rafael Chacon <[email protected]>
* Some general code cleanup.
* Discovered bug in cancelMasterMigration. When there are failures updating the
  topology, source shards will ended being assigned as nil. This will cause a
  panic reversing the migration.

Signed-off-by: Rafael Chacon <[email protected]>
Signed-off-by: Rafael Chacon <[email protected]>
Signed-off-by: Rafael Chacon <[email protected]>
Signed-off-by: Rafael Chacon <[email protected]>
VtCombo was not initializing shard or serving keyspace graph. Now that serving
data is derived from here, we need to make sure that this gets called before
gates can serve traffic.

Signed-off-by: Rafael Chacon <[email protected]>
* I think this test was passing by luck because we didn't have any real
  operations that interacted with that cell. Given the global topology refactor
  part of this PR, this has changed and now this breaks.
* I think the regression happened when samuel/go-zookeeper was introduced, but
  this was never discovered. This test was originally introduced in:
  vitessio@c685aaf

Signed-off-by: Rafael Chacon <[email protected]>
Signed-off-by: Rafael Chacon <[email protected]>
@rafael rafael requested a review from sougou as a code owner February 14, 2019 21:28
@rafael rafael force-pushed the 4496-topo-serving-shards-refactor branch from ac08bec to d8afcf0 Compare February 17, 2019 01:03
Signed-off-by: Rafael Chacon <[email protected]>
@rafael rafael changed the title WIP - 4496 topo serving shards refactor 4496 topo serving shards refactor Mar 9, 2019
@@ -115,6 +124,403 @@ func (ts *Server) GetSrvKeyspaceNames(ctx context.Context, cell string) ([]strin
}
}

// GetShardServingCells returns cells where this shard is serving
func (ts *Server) GetShardServingCells(ctx context.Context, si *ShardInfo) (servingCells []string, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should not exist. We should instead just use all cells without assuming anything. And then deal with absence of ShardReference only if it's actually relevant.

Also, I think a ShardReference is supposed to be there in all cells, right? The only time it can be missing is if someone didn't do a Rebuild after creating a new shard.

In any case, it's better to go through all cells for these things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I'll update accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sugu - I was looking into this a bit more and I was thinking that it might be better do this in a future iteration.

There are some parts where I will need to make some slight modifications. For instance:

	servingCells, err := wr.ts.GetShardServingCells(ctx, shardInfo)
	if err != nil {
		return err
	}
	// Check the Serving map for the shard, we don't want to
	// remove a serving shard if not absolutely sure.
	if !evenIfServing && len(servingCells) > 0 {
		return fmt.Errorf("shard %v/%v is still serving, cannot delete it, use even_if_serving flag if needed", keyspace, shard)
	}

or

func (wr *Wrangler) RemoveShardCell(ctx context.Context, keyspace, shard, cell string, force, recursive bool) error {
	shardInfo, err := wr.ts.GetShard(ctx, keyspace, shard)
	if err != nil {
		return err
	}

	shardServingCells, err := wr.ts.GetShardServingCells(ctx, shardInfo)

	// check the cell is in the list already
	if !topo.InCellList(cell, shardServingCells) {
		return fmt.Errorf("cell %v in not in shard info", cell)
	}

It's not really big, but given that this PR is already huge. Thinking that making this change as a separate PR makes sense.

go/vt/topo/srv_keyspace.go Show resolved Hide resolved
go/vt/topo/srv_keyspace.go Outdated Show resolved Hide resolved
go/vt/topo/srv_keyspace.go Outdated Show resolved Hide resolved
go/vt/topo/srv_keyspace.go Outdated Show resolved Hide resolved
Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

That Add/Delete split didn't get that much better, but I think the call sites read better. So, I guess it's a net win.

@sougou sougou merged commit b7165dc into vitessio:master Mar 19, 2019
@rafael rafael deleted the 4496-topo-serving-shards-refactor branch March 19, 2019 04:53
Copy link
Member

@dweitzman dweitzman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slowly looking through parts of this diff to have a better understanding of what'll change when we merge it. Sending a few comments/questions.

go/vt/topotools/rebuild_keyspace.go Show resolved Hide resolved
// We rebuild keyspace iff:
// 1) shard master is in a serving state.
// 2) shard has served type for master (this is for backwards compatibility).
if !(si.IsMasterServing || si.GetServedType(topodatapb.TabletType_MASTER) != nil) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately clear to me why it wouldn't be necessarily anymore to update partition information for replica and rdonly served types for a shard even if the master isn't serving from that shard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This is one of the core parts of this change. @sougou and I realized that if a master is serving, from a logical perspective that's the only thing you need to know to generate the SrvKeyspaceGraph. You could be in one of these scenarios:
  1. User creates a new keyspace with new shards. In init tablet, because there are no overlapping shards IsMasterServing will be set to true. When you build the keyspace graph for this newly created keyspace, this code will generate the serving graph for all types: rdonly, replica, master.
  2. If you are creating shards in an existent keyspaces (to prepare for a split), InitTablet will IsMasterServing to false, and rebuild keyspace won't take those new shards into consideration for any type.

When you are in the process of migrating serving types, there is code there that will modify the serving keyspace graph to add/remove new shards accordingly. But that does not go through this code path.

go/vt/vttablet/tabletmanager/init_tablet.go Show resolved Hide resolved
go/vt/wrangler/keyspace.go Show resolved Hide resolved
_, err = wr.ts.UpdateShardFields(ctx, keyspace, shard, func(si *topo.ShardInfo) error {
return si.UpdateServedTypesMap(servedType, cells, remove)
})
// TODO: What should we do with this method?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like SetShardServedTypes() should probably be deleted along with the vtctl command.

If not, this method probably at least raise an error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants